Python: Implement message filtering in agent workflow to stop emitting already seen responses#4268
Python: Implement message filtering in agent workflow to stop emitting already seen responses#4268leelakarthik wants to merge 7 commits intomicrosoft:mainfrom
Conversation
97fedac to
85070bc
Compare
@microsoft-github-policy-service agree |
Updated: Added tests and clarified filtering rationaleWhat this fixesIssue #4261: When a GroupChat orchestrator terminates, it yields Root cause analysisThe termination path in # _check_terminate_and_yield / _check_agent_terminate_and_yield
await ctx.yield_output(self._full_conversation) # list[Message]This hits the Fix:
|
|
@copilot review |
moonbox3
left a comment
There was a problem hiding this comment.
Automated Code Review
Reviewers: 3 | Confidence: 83%
✗ Correctness
The
_filter_messagesmethod is logically sound for the stated purpose of preventing user-input compounding, but it has a return-type annotation bug (| None) even thoughNoneis never returned—callers use.extend()andfor … inwhich would crash onNone. Additionally, the regression tests yieldAgentResponse(notlist[Message]), so they likely never exercise the_filter_messagescode path that this PR adds, giving false confidence in the fix.
✗ Security Reliability
The new
_filter_messagesmethod declares a return type oflist[Message] | None, but neither call site guards againstNone. While the current implementation never actually returnsNone, the misleading type annotation invites a future change that would crash both callers (messages.extend(None)andfor msg in NoneraiseTypeError). Additionally, after filtering,raw_representations.append(data)still records the original unfiltered list, creating a mismatch between the response messages and their raw representations. Finally, when all messages haverole='user', the filter silently returns an empty list, dropping the workflow output with no log or error.
✗ Test Coverage
The new
_filter_messagesmethod is exercised by the updated happy-path test, but its edge cases (empty list, all-user messages, assistant messages with None/whitespace-only text triggering the fallback path) have no coverage. More critically, the two regression tests inTestWorkflowAgentUserInputFilteringRegressionyieldAgentResponseobjects viaWorkflowContext[Never, AgentResponse], notlist[Message]; since_filter_messagesis only invoked on theis_instance_of(data, list[Message])branch, these regression tests likely never exercise the new filtering logic and therefore don't guard against the regression they describe. The return type annotationlist[Message] | Noneis also misleading because the method never returnsNone, and callers pass the result directly tolist.extend()which would raiseTypeErroronNone.
Blocking Issues
- The return type of
_filter_messagesis declared aslist[Message] | None, but the function never returnsNone. Callers at lines 487 and 604 call.extend(chat_messages)andfor msg in chat_messages:respectively, which would crash onNone. More importantly, a static type-checker will flag these call-sites as unsafe. The| Noneshould be removed from the annotation. - The regression tests (
TestWorkflowAgentUserInputFilteringRegression) useWorkflowContext[Never, AgentResponse]and yield anAgentResponseobject—not alist[Message]. The_filter_messagescall is only reached inside theis_instance_of(data, list[Message])branch, so these tests never exercise the new filtering logic. They may pass regardless of whether_filter_messagesexists, providing no actual regression coverage for the fix. - _filter_messages declares return type
list[Message] | Nonebut both callers use the result directly inextend()andforiteration without a None check. If the method is ever modified to return None (as the type hint permits), both call sites will raise TypeError at runtime. - The regression tests (
test_streaming_compounding_not_observed_across_turns,test_nonstreaming_compounding_not_observed_across_turns) useWorkflowContext[Never, AgentResponse]and yield anAgentResponseobject._filter_messagesis only called whenis_instance_of(data, list[Message])is true, so these tests almost certainly do not exercise the new filtering code path. They should yieldlist[Message](matching the existing test pattern) to actually cover the fix. - The fallback path of
_filter_messages(where no non-user message has non-empty text) is completely untested. When all messages are user messages, the method returns an empty list, silently dropping all output — this behaviour should be tested and verified as intentional.
Suggestions
- Consider what should happen when every message in the list has
role == "user"(e.g., an echo workflow). The fallback returns[], silently dropping all output. It may be safer to return the original list unchanged in that edge case so no data is silently lost. - In
_convert_workflow_events_to_agent_response(line 488),raw_representations.append(data)still appends the unfiltered original list whilemessages.extend(chat_messages)uses the filtered version. Document that this asymmetry is intentional, or filterraw_representationstoo, to avoid confusing downstream consumers. - Consider logging a warning inside
_filter_messageswhen the returned list is empty (all messages were user-role), so silent data loss is diagnosable. - At line 488,
raw_representations.append(data)still records the original unfiltered list whilemessagescontains the filtered subset—verify this mismatch is intentional and won't confuse downstream consumers that correlate messages with raw_representations. - Add direct unit tests for
_filter_messagescovering: empty input list, single assistant message, all-user messages, assistant message withtext=None, assistant message with whitespace-only text, and mixed roles where the last non-user message has no text (fallback path). - Fix the return type annotation from
list[Message] | Nonetolist[Message]since the method never returnsNoneand callers pass the result tolist.extend()which would TypeError onNone.
Automated review by moonbox3's agents
…ng messages in response Add a method to filter messages and update message handling. This helps remove the user messages being duplicated/emitted in the response This fixes the microsoft#4261 for both streaming and non streaming functions
…ge] filtering expectations (microsoft#4261) - Update test_workflow_as_agent_yield_output_with_list_of_chat_messages to reflect _filter_messages semantics: only the last meaningful assistant message is surfaced from list[Message] output, preventing user input re-emission and full conversation history replay. - Add TestWorkflowAgentUserInputFilteringRegression with multi-turn compounding regression tests for both streaming and non-streaming paths, reproducing the exact escalating symptom from microsoft#4261. Users who need intermediate agent responses can opt in via intermediate_outputs=True in GroupChatBuilder.
Co-authored-by: Evan Mattson <35585003+moonbox3@users.noreply.github.com>
…age and add unit tests
3a6c46d to
3debdb4
Compare
|
@moonbox3 All review feedback addressed — tests added, inline comment added for raw_representations asymmetry, and branch rebased on latest main. Ready for another look when you get a chance. |
Summary
Fixes #4261
When a GroupChat orchestrator terminates, it yields
self._full_conversation(the entire conversation history) via
ctx.yield_output(). Without filtering,WorkflowAgent converts all messages — including user inputs and earlier assistant
responses — into output, causing them to compound across successive turns.
Root Cause
The termination path in
BaseGroupChatOrchestratorcallsctx.yield_output(self._full_conversation)which is alist[Message]. This hitsthe
list[Message]branch in both_convert_workflow_event_to_agent_response_updates(streaming) and
_convert_workflow_events_to_agent_response(non-streaming), whereall messages were forwarded without filtering.
Fix
Added
_filter_messages()that returns only the last meaningful assistant messagefrom
list[Message]output. This is intentionally aggressive because:including them again causes duplication
GroupChatBuilderdefaults tointermediate_outputs=False. Users who need intermediate responses can opt invia
intermediate_outputs=TrueRelationship to #4275
PR #4275 filters the
AgentResponsebranch. This PR filters thelist[Message]branch — the actual code path hit by GroupChat's termination yield. Both fixes are
complementary.
Tests
test_workflow_as_agent_yield_output_with_list_of_chat_messagestoreflect new filtering semantics
TestWorkflowAgentUserInputFilteringRegressionwith multi-turn compoundingregression tests for streaming and non-streaming paths
Contribution Checklist